Detection of nonverbal vocalizations using Gaussian mixture models: looking for fillers and laughter in conversational speech
نویسندگان
چکیده
In this paper, we analyze acoustic profiles of fillers (i.e. filled pauses, FPs) and laughter with the aim to automatically localize these nonverbal vocalizations in a stream of audio. Among other features, we use voice quality features to capture the distinctive production modes of laughter and spectral similarity measures to capture the stability of the oral tract that is characteristic for FPs. Classification experiments with Gaussian Mixture Models and various sets of features are performed. We find that Mel-Frequency Cepstrum Coefficients are performing relatively well in comparison to other features for both FPs and laughter. In order to address the large variation in the framewise decision scores (e.g., log-likelihood ratios) observed in sequences of frames we apply a median filter to these scores, which yields large performance improvements. Our analyses and results are presented within the framework of this year’s Interspeech Computational Paralinguistics sub-Challenge on Social Signals.
منابع مشابه
Speech Enhancement Using Gaussian Mixture Models, Explicit Bayesian Estimation and Wiener Filtering
Gaussian Mixture Models (GMMs) of power spectral densities of speech and noise are used with explicit Bayesian estimations in Wiener filtering of noisy speech. No assumption is made on the nature or stationarity of the noise. No voice activity detection (VAD) or any other means is employed to estimate the input SNR. The GMM mean vectors are used to form sets of over-determined system of equatio...
متن کاملSpeech Enhancement using Laplacian Mixture Model under Signal Presence Uncertainty
In this paper an estimator for speech enhancement based on Laplacian Mixture Model has been proposed. The proposed method, estimates the complex DFT coefficients of clean speech from noisy speech using the MMSE estimator, when the clean speech DFT coefficients are supposed mixture of Laplacians and the DFT coefficients of noise are assumed zero-mean Gaussian distribution. Furthermore, the MMS...
متن کاملLaughter detection using ALISP-based N-Gram models
Laughter is a very complex behavior that communicates a wide range of messages with different meanings. It is highly dependent on social and interpersonal attributes. Most of the previous works (e.g. [1, 2]) on automatic laughter detection from audio uses frame-level acoustic features as parameters to train their machine learning techniques, such as Gaussian Mixture Models (GMMs), Support Vecto...
متن کاملLaughter and filler detection in naturalistic audio
Laughter and fillers are common phenomenon in speech, and play an important role in communication. In this study, we present Deep Neural Network (DNN) and Convolutional Neural Network (CNN) based systems to classify non-verbal cues (laughter and fillers) from verbal speech in naturalistic audio. We propose improvements over a deep learning system proposed in [1]. Particularly, we propose a simp...
متن کاملSpotting Social Signals in Conversational Speech over IP: A Deep Learning Perspective
The automatic detection and classification of social signals is an important task, given the fundamental role nonverbal behavioral cues play in human communication. We present the first cross-lingual study on the detection of laughter and fillers in conversational and spontaneous speech collected ‘in the wild’ over IP (internet protocol). Further, this is the first comparison of LSTM and GRU ne...
متن کامل